[ADAM-883] Add caching to Transform pipeline. #884

fnothaft · 2015-11-18T22:38:14Z

The Transform pipeline in the CLI has several stages (e.g., sort, indel
realignment, BQSR) that trigger recomputation. If you are running a single
stage off of local storage/HDFS/Tachyon, this is OK. However, if you're running
multiple stages, or you are loading data from S3/etc, this can lead to serious
performance degradation. To address this, I've added the proper caching
statements. Additionally, I've added a hook so that the user can specify the
storage level to use for caching. Resolves #883.

AmplabJenkins · 2015-11-18T23:06:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1016/
Test PASSed.

heuermh · 2015-11-19T01:37:26Z

adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala

@@ -105,6 +106,10 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
  var mdTagsFragmentSize: Long = 1000000L
  @Args4jOption(required = false, name = "-md_tag_overwrite", usage = "When adding MD tags to reads, overwrite existing incorrect tags.")
  var mdTagsOverwrite: Boolean = false
+  @Args4jOption(required = false, name = "-cache", usage = "Cache data to avoid recomputing between stages.")
+  var cache: Boolean = false
+  @Args4jOption(required = false, name = "-storageLevel", usage = "Set the storage level to use for caching.")


-storageLevel → -storage_level?

Good catch; will fix. Thanks!

The Transform pipeline in the CLI has several stages (e.g., sort, indel realignment, BQSR) that trigger recomputation. If you are running a single stage off of local storage/HDFS/Tachyon, this is OK. However, if you're running multiple stages, or you are loading data from S3/etc, this can lead to serious performance degradation. To address this, I've added the proper caching statements. Additionally, I've added a hook so that the user can specify the storage level to use for caching. Resolves bigdatagenomics#883.

fnothaft · 2015-11-19T03:02:06Z

Fixed nit and rebased.

AmplabJenkins · 2015-11-19T03:18:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1018/
Test PASSed.

[ADAM-883] Add caching to Transform pipeline.

heuermh · 2015-11-19T04:46:01Z

Thanks!

heuermh reviewed Nov 19, 2015
View reviewed changes

fnothaft force-pushed the caching branch from c7414ac to 9962d5f Compare November 19, 2015 03:01

heuermh added a commit that referenced this pull request Nov 19, 2015

Merge pull request #884 from fnothaft/caching

5845b15

[ADAM-883] Add caching to Transform pipeline.

heuermh merged commit 5845b15 into bigdatagenomics:master Nov 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAM-883] Add caching to Transform pipeline. #884

[ADAM-883] Add caching to Transform pipeline. #884

fnothaft commented Nov 18, 2015

AmplabJenkins commented Nov 18, 2015

heuermh Nov 19, 2015

fnothaft Nov 19, 2015

fnothaft commented Nov 19, 2015

AmplabJenkins commented Nov 19, 2015

heuermh commented Nov 19, 2015

[ADAM-883] Add caching to Transform pipeline. #884

[ADAM-883] Add caching to Transform pipeline. #884

Conversation

fnothaft commented Nov 18, 2015

AmplabJenkins commented Nov 18, 2015

heuermh Nov 19, 2015

Choose a reason for hiding this comment

fnothaft Nov 19, 2015

Choose a reason for hiding this comment

fnothaft commented Nov 19, 2015

AmplabJenkins commented Nov 19, 2015

heuermh commented Nov 19, 2015